1. Field of the Invention
The disclosure generally relates to a method and apparatus for providing a document summary. More specifically, the disclose relates to method and apparatus for providing document summaries that cover a majority of the concepts discussed in the document while being within a length requirement provided by the user.
2. Description of Related Art
The problem of identifying the gist of a document is conventionally referred to as the text summarization or document summarization problem. Traditional document-summarization techniques focus on the central idea of the text. With the rapid explosion of content on the world wide web, particularly in online text collections, it has become useful to provide improved mechanisms for identifying the important information themes associated with a text document or a collection of documents.
Consider an automatic syndication feed (also known as the RSS or automatic feed) that contains users' comments about a specific movie. Summarization is necessary since there may be several hundred comments written on the movie. However, the challenge is that there usually does not exist a single central idea in these comments. Each one of these comments often has a different focus because every user looks at the movie from a different angle. Some users comment about the scenery, some about the actors or the director, others about the plot itself, etc. From the reader's perspective, going through all the reviews is a tedious and often annoying task. At the same time, it is useful, and often necessary, to get a quick idea of what other moviegoers think about the movie. This would require the generation of a summary that covers different aspects of the comments written by the different users.
This scenario brings out an important point that summarization has become significant in helping us deal with the information explosion currently underway. The significance of this phenomenon is amplified by the fact that the above-discussed case applies to many other domains and applications, e.g., when someone is reading online comments and discussions following blogs, videos and news articles.
In any text summarization, the problem becomes picking the right sentences from the original document so that these sentences can capture different viewpoints. This requirement is further refined by the following two criteria: (a) Coverage—The summary should consist of sentences that span a large portion of the spectrum of aspects discussed in the document; and (b) Orthogonality—Each sentence in the summary should capture different aspects of the document in the summary and should be as orthogonal to each other as possible.
The conventional summarization methods do not address coverage and orthagonality. Therefore, there is a need for a method and apparatus to provide a system and method for highlighting diverse aspects in a document.